Directory fmnist_data is created!
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to fmnist_data/FashionMNIST/raw/train-images-idx3-ubyte.gz
Extracting fmnist_data/FashionMNIST/raw/train-images-idx3-ubyte.gz to fmnist_data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to fmnist_data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
Extracting fmnist_data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to fmnist_data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to fmnist_data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
Extracting fmnist_data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to fmnist_data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to fmnist_data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
regularizayion techniques in Neural networks. Regularization is a set of techniques that can prevent overfitting in neural networks and thus improve its accuracy.
But Firstly, lets train our model first, to see unregularized result.
Before talking about batch normalization we should talk about so called covariant shift. Covariant shift in machine learning is a type of model drift which occurs when the distribution of independent variables changes between the training environment and live environment. It occurs when the distribution of variables in the training data is different to real-world or testing data.
Some example
Covariate drift can cause serious issues with speech recognition models because of the diversity of voices, dialects and accents in spoken word. For example, a model may be trained on English speakers from a specific area with a specific accent. Although the model may achieve a high degree of accuracy with the training data, it will become inaccurate when processing spoken language in a live environment. This is because processing speech with new dialects or accents will be a different input distribution to the training dat.
Screenshot 2023-10-29 at 15.22.36.png
Where to insert Batch Norm?
Anywhere, but authers offer to insert it before non-linearity
image.png
But when we have skip connections it is better to use BN before the layer
We can see that for our toy example, Layer normalization provided the best performance
REGULARIZATION
DROPOUT
Screenshot 2023-10-29 at 18.03.46.png
Screenshot 2023-10-29 at 17.12.24.png
p - probability to turn off nueron
Large models are proned to overfitting
With unlimited computational resources, the most effective method to “regularize” a fixed-sized model is to average the predictions of all possible parameter settings, weighting each setting based on its posterior probability given the training data.
Training numerous architectures is challenging due to the complexity of finding optimal hyperparameters for each one and the substantial computational resources required to train large networks. Additionally, large networks typically necessitate abundant training data, which may not always be available to train various networks on different data subsets. Even if multiple large networks could be trained, using them all at test time is impractical in scenarios where rapid response is crucial.
Dropout, a technique that addresses these challenges, prevents overfitting and offers an efficient way to combine a vast number of neural network architectures.
Dropout acts differently on training and validation phases.
image.png
Fill in the model with dropout
model_do = nn.Sequential( nn.Flatten(),# insert the dropout regularization into the net nn.Linear(input_shape, ...), ... nn.ReLU(), .... nn.Linear( ..., num_classes),).to(device)opt_do = torch.optim.Adam(model_do.parameters(), lr=3e-4)loss_func_do = nn.CrossEntropyLoss()
We can see that the that L2 regularization is prone to reduce the bigger weights, whereas L1 regularization does’not draw any attantion on the magnitude of the weights